Summary

Rough feature annotation

  • Manually subdivide genome into: “Hat”, “Internal” and “External” genes.
  • Genome bubble is hollow.
  • “Internal” genes are below the average of distances from bubble’s centroid.

Promoter-site histone modification signal preprocessing

  • Selected 9 signals that are mostly populated: H3K4, H3K6, H3NTerm, H4NTerm, H2A, H2B, H3, H4, H2AZ.
  • Filled-in the rest as median of the closest 3 histone-profile-space vectors.

Histone communities

  • Extracted all genes with CommunityScore > 2. Read: “twice as distant from all other genes than local group of 5”.
  • x-means to get 7 histone-profile clusters.
  • knn (k=3) to classify all genes in each of the 7 classes.
  • Measuring entropy on a sliding window of 25 bp.
  • Distribution of entropy of sliding windows is shifted compared to that of random sets of 25.
  • Distribution of window-entropy shows a small peak at p-value < 0.1%.
  • We filter all windows by that threshold (corresponding to 1.28 bits of entropy).
  • The vast majority of their genes lies within the “Internal” rough feature.
  • Resulting segments (named by chromosome): II, IV-A, IV-B, VII-A, VII-B, X, XI-A, XI-B, XIII, XIV-A, XIV-B, XVI.

A few segments have significant gene enrichments:

Community Term
II acetyl-CoA transmembrane transporter activity
IV-A Factor: Gal4p, Factor: Gal4p, Factor: SWI4P
VII-B Factor: DAL81
X Factor: Mbp1p, Factor: ARG

Sphere test

  • Sample random spheres of set radius. Center x,y,z drawn from uniform distribution.
  • Each sphere contains 50-150 genes.
  • Compute a different statistic per signal.
  • Calculate p-value by comparing against 10000 random samples of the same gene count.
  • Adjust p-values and filter for an FDR of 1% or 5% depending on the signal.
  • Glue overlapping spheres together. Stop at 5% overlap, which we consider distinct.

Promoter-site histone profile

Statistic: average pairwise histone-space Euclidean distance.

Geometry

  • 7 clusters, completely distinct (no need to apply the 5% rule).
  • All clusters are in the bubble region.
  • They appear on one side of the bubble, penetrating the whole side, from surface to the hollow core.
  • They never extend to the other side of the bubble.

Histone profile

  • The distribution of each histone variant is almost identical to the general distribution.
  • What changes is a subtle tendency of over- or under-representing a histone variant.
  • Albeit shifted, the cluster distribution remains within the general distribution because of its reduced variance.
  • Statistical significance for each cluster is achieved by the entire group of genes moving to the same over/under direction for all histone variants.
  • If we plot the distribution of means of all clusters and all histone variants, we get a multimodal distribution.
  • The distribution of means has zero density at the mean of the general population.
  • All 7 clusters occupy a different spot in the over/under-representation space.

Gene ontology enrichment

Only the larger clusters produce enrichments:

Term PAdj
D Terpenoid backbone biosynthesis 4%
E Ribosomal small subunit binding 5%
F Factor: HSF; motif: AGAANAGAANAGAAN 2%
G Factor: HSF; motif: NTTCTAGAANAGAAN 1%

Relationship to SSD & WGD

  • Without exception all 7 clusters under-represent duplicated genes (enrichments of ~0.7)
  • Cluster D is the extreme case where we get enrichment ~0.5 for both SSD & WGD

Transcription factor binding motifs

Statistic: Average pairwise Jaccard distance in motif presence / absence in a sphere. Double-tailed p-value. So these are both spheres of lower and greater than normal Jaccard distance.

Geometry

  • 4 small clusters.
  • 3 of them are found at the centromeric pole.
  • The 4th, which is the smaller one, takes up a quarter of the “Hat” region.

Gene ontology enrichment

Cluster A produces a rich g:profiler result because of 4 “special” "genes it contains: YLR160C, YLR158C, YLR155C, YLR157C.

Term PAdj TF motif enrichment
A Cyanoamino acid metabolism 0.001%
B ATPase-coupled transmembrane transporter activity 2% Factor: MAC1; motif: TTTGCTCAM
C Regulation of attachment of spindle microtubules to kinetochore 0.9%
D Factor: GCR2; motif: GCTTCCN

Pairwise gene coexpression score

Coexpression is highly correlated with proximity. Roughly half the genome exhibits strong (PAdj < 1%) local coexpression significance.

Geometry

  • The entire genome is covered by these clusters.
  • Two small coexpression clusters appear at different sectors of the Hat region.
  • These sectors do not overlap with the Motif sector (the small Motif cluster appearing on the Hat).
  • There appears to be an insulator splitting the genome in half (“left” and “right” hemispheres).
  • Each hemisphere contains half of the bubble and half of the hat.
  • The small clusters (A and B) are fully separated by the insulator and on the same hemisphere as the two large ones (F and G, respectively).

Gene ontology enrichment

There appear some enrichments in the larger clusters, but these are too generic (“intracellular anatomical structure” and such).

Signals not yielding clusters

These seem to be randomly distributed in 3D space:

  • SSD / WGD / gene duplication in general
  • Number of species where each gene is present, encompassing taxon enrichment.

Field overlap

  • Notable overlaps always involve one of the two major coexpression clusters (F and G).
  • We notice the bipolar appearance of coexpression groups and thus divide the genome in 3+1 continents: Laurasia, Godwana, Antarctica and Tethys.
  • Laurasia in the larger one, sitting on one side of the nucleus.
  • Godwana is the second larger, sitting on the opposite side.
  • Both Laurasia and Godwana encompass part of the Hat.
  • They are separated by the Tethys compartment, which splits the genome in half. Even the Hat has its own schism.
  • A separate “continent” is formed by the centromeres. We call this smaller one “Antarctica”.
  • All 3 continents have a greater average gene coexpression score than random.
  • Tethys (insulator) has a lower average coexpression score than random.

Gene ontology enrichment

  • The Tethys (coexpression insulator) is rich in genes having to do with “Metabolic pathways” (Padj: 2%)

Genes in 3D and rough features

Gene positions

We used the haploid genome model by Duan et al. to extract positions of all Sacharomyces cerevisiae genes.

Why the model can adequately place genes in space.

What do the 3D positions mean

Not a static configuration. Position implies “preferred neighborhood”.

Rough feature annotation

We have observed and annotated three rough regions of the genome:

  • ‘Hat’ genes forming a separate domain (z > 175)
  • “Internal” and “External” genes: the rest of the genes form a hollow bubble. We annotate genes facing inwards as “Internal” and the ones facing outwards as “Exernal”.

Centroid of the “bubble”: 108.823, 71.102, 81.404

SELECT Gene, sqrt((x - 108.823) * (x - 108.823) + (y - 71.102) * (y - 71.102) + (z - 81.404) * (z - 81.404)) AS Radioid FROM Loci WHERE z < 175

It turns out that the bubble is hollow:

Bubble radioids

Average distance from the centroid is: 56.148

CREATE TABLE Labels AS 
SELECT Gene, 
CASE
    WHEN z > 175 THEN "Hat"
    ELSE CASE
        WHEN (x - 108.823) * (x - 108.823) + (y - 71.102) * (y - 71.102) + (z - 81.404) * (z - 81.404) > 3152.644 THEN "External"
        ELSE "Internal"
    END
END Label
FROM Loci

Rough features visualizeed

conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
genePositionsLabels <- dbGetQuery(conn, "SELECT Loci.Gene, x, y, z, Label FROM
Loci JOIN RoughFeatures ON Loci.Gene = RoughFeatures.Gene")
plot_ly(x=genePositionsLabels$x, y=genePositionsLabels$y, z=genePositionsLabels$z, type="scatter3d", mode="markers", text=genePositionsLabels$Gene, color=genePositionsLabels$Label)
dbDisconnect(conn)

Promoter histone modfications and rough features

Histone modifications are agnostic to our structural domain labels.

Histones agnostic to inside vs outside

Preprocessing of histone signals

Original data

The input file consists of rows, each beginning with a gene name and followed by tab-separated real numbers forming the columns. Each column corresponds to a histone modification.

The numbers are in z-score across all columns.

A lot of cells have missing values. I think this means “no reads” or just a few reads. Need to clarify this.

Citation required here: what is the origin of the file?

Selection of histone modifications

During visual inspection of the input file, we notice that a lot of the columns are too sparse.

Sparse histone modifications

Regardless of the way we will choose to handle these gaps, we would need to introduce some assumption and noise to our data. Se we decided to not take these histone modifications into consideration.

We have thus selected the following histone modifications to work with:

  • H3K4
  • H3K6
  • H3NTerm
  • H4NTerm
  • H2A
  • H2B
  • H3
  • H4
  • H2AZ

These still have occassional missing values.

Occassional missing values

Patching the missing values

Consider each gene as a vector of 9 dimensions. We patch the missing value using the following rule: * Find the 3 closest vectors (Euclidean distance) which happen to have the value populated. * Use the median of the 3.

The advantages of this method are:

  • We don’t introduce arbitrary values.
  • Patching takes into account the entire data set.
  • Continuity in one- or three-dimensional space is preserved.

Patched data: Patched histones

Histone distance maps

Distance maps

We visualize the distance (as a heatmap), between each gene and every other gene.

All chromosomes yield a similar picture.

Evaluating the distance maps

The black diagonal corresponds to a zero distance (all genes have a zero distance to themselves).

The first thing one notices in these pictures are the distinct red parallel lines (horizontal and vertical). A red line marks a gene with a particular histone profile, which is relatively unique (it differs from all other genes), thus the distance from all the other genes is significant.

The interesting thing though is that we get these unique profiles in clusters in the linear chromosome. This manifests as a black rectangle in the intersection of two thick red lines:

Community in distance map

Histone communities

We will call these local groups “Histone communities”, as they consist of genes with similar and unique histone modifications that converge together in space.

Histone communities

Community score

We have calculated the “community score” for each gene.

The score is defined as: A / N, where:

  • A is the average Euclidean distance (in histone space) from the gene to all other genes of the same chromosome.
  • N is the average distance of the genes of the local group of 5 genes (+/- 2 genes from the genes we are examining).

We bias the denominator to be 0.17 as a minimum. This corresponds to average of all N minus 3 standard deviations, so as to prevent divisions by zero. The 3 major spikes that can be seen in the picture below are such instances.

Filtering by community score

We consider a value of ‘2’ as significant. In other words, a gene being twice as similar as their neighbours than its similarity to all other genes in the chromosome is considered a “community center” gene. Here is a hierarchical clustering of the histone modifications of community centers:

Deciding number of clusters

We have chosen the community center genes in such a way that we expect clear-cut clusters to be formed in histone space. Plotting the within-cluster total sum of squares we expect to see a clear-cut elbow.

x-means

It is not easy to decide where the elbow is. I would say it is on 3 clusters, but there’s a lot more variance to be explained below that. Searching for a less ambiguous method, I stumbled upon x-means.

x-means clustering algorithm outputs not only the k clusters, but also k itself. It does so by optimizing an objective function - the Bayesian information criterion - which reveals the mosto likely k. Experiments by the authors have shown it to reveal the true model underlying the data.

bicOptimizedClusters <- Xmeans(centers, kmax=10)
Cluster <- bicOptimizedClusters$cluster
centers <- cbind(centers, Cluster)
Cluster
##   [1] 6 7 6 1 7 2 3 4 4 4 4 6 7 7 7 6 2 2 3 6 7 4 6 5 6 4 7 6 4 4 4 7 7 5 5 5 5
##  [38] 6 4 4 6 4 6 5 5 3 3 3 6 6 3 3 6 7 7 7 7 4 4 3 1 7 6 6 6 6 4 6 6 6 4 6 4 7
##  [75] 7 7 7 5 5 7 7 6 4 6 7 7 7 2 6 2 7 7 5 2 4 4 4 7 7 7 6 4 7 6 6 6 7 6 2 2 2
## [112] 6 4 4 6 6 6 1 1 4 7 7 7 7 6 6 4 4 4 6 6 3 3 3 3 4 4 4 4 7 4 7 7 7 6 6 2 2
## [149] 2 6 7 6 6 3 3 7 4 7 7 5 7 7 2 7 6 6 7 7 6 4 7 7 6 3 3 6 6 1 6 4 4 7 6 4 7
## [186] 4 2 2 6 6 5 7 7 7 4 7 7 3 3 2 7 7 6
bicOptimizedClusters$size
## [1]  5 16 18 40 12 54 58

x-means comes up with k=7, which explains most of the variance. Beyond 7 clusters it is evudebt that there’s not much to be gained.

Saving the community centers to database in CommunityCenters table, containing Gene name and Community (integer).

# Insert community centers and corresponding cluster ids to db
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
dbExecute(conn, "DROP TABLE IF EXISTS CommunityCenters")
dbExecute(conn, "CREATE TABLE CommunityCenters(Gene TEXT PRIMARY KEY, Community INTEGER)")
for (i in 1:nrow(centers)) {
    sql <- paste("INSERT INTO CommunityCenters(Gene, Community) VALUES('", row.names(centers)[i], "',", centers[i, 10], ")", sep='')
    dbExecute(conn, sql)
}
dbDisconnect(conn)

Community centers in 3D

In this plot we only see the community centers. Each dot represents a gene and implies a group of 5 genes around it, having a similar histone profile.

Subsequently, we are going to classify the entire genome using these centers as a scaffold, in histone space.

Communities in 3D

After classifying all genes using the above scaffold and knn algorithm (k=3), we can plot them in space.

Communities include a central histone-space cluster (aroung zero in all dimensions), which allows us to classify the entire set. In other words, one of the “extremes” is actually the center of histone space, distant from all other community centers.

Communities in 1D

Here is how the communities look in the linear chromosome. Genes are displayed in succession where each gene has a width of 1 pixel:

Communities on linear chromosome

Measuring Shannon entropy

We sample windows of 25 consecutive genes and measure Shannon entropy. Then we sample “clouds” of 25 uniformly random genes (with replacement) and compare the distributions of entropy.

Shannon entropy distributions for consecutive and cloud (random sets of) genes

We used a total of 100000 random samples for both windows and clouds.

Unlike the “cloud” distribution which is normal, the “window” distribution differs both in shape and in its displacement towards smaller values. Its multimodal shape hints a hierarchy of structure, starting from a set of very low-entropy windows (around 1.2 bits of entropy), two more less structured and more abundant sets at 1.5 and 1.7 bits and finally, after the average has been reached, the distribution collapses to normality.

Probably most interesting result is the first step of the staircase. This corresponds to a set of windows having around 1.2 bits of entropy. These account for around 1% of the samples.

The entire bubble lies in an area of low (<0.001) p-value with respect to the cloud distribution. An entropy limit of 1.28 bits corresponds to that p-value. We can filter our windows based on that.

Tight communities

Windows of 25 genes, filtered for a maximum of 1.28 bits of entropy.

2D view

Tight communities on linear chromosome

3D view

Distance distribution

We now have sets of genes having both:

  • Similar histone profiles, and
  • Proximity in the linear chromosome

Does this imply proximity in 3D space?

To answer this, we measure the Euclidean distance in 3D space between “tight community” genes and between random pairs of genes. To remove the advantage of the first group, which is that they are consecutive by definition, we only sample pairs from distinct chromosomes.

Distance distributions of tight communities and random gene pairs. 28000 samples taken for each category.

The resulting p-value is approximately 0.57. Community genes do not appear to be correlated in 3D space. The multiple peaks are expected; each represents the interaction of a pair of communities.

Resulting gene lists

Communities are named using a combination of the chromosome name and the optional suffix of -A or -B, which signifies the first or second cluster found in given chromosome. A given chromosome may contain zero, one or two clusters.

II

YBR203W YBR204C YBR205W YBR206W YBR207W YBR208C YBR209W YBR210W YBR211C YBR212W YBR213W YBR214W YBR215W YBR216C YBR217W YBR218C YBR219C YBR220C YBR221C YBR222C YBR223C YBR225W YBR226C YBR227C YBR228W YBR229C YBR230C YBR231C YBR232C YBR233W YBR234C YBR235W YBR236C YBR237W

IV-A

YDL013W YDL012C YDL011C YDL010W YDL009C YDL008W YDL007W YDL006W YDL005C YDL004W YDL003W YDL002C YDL001W YDR001C YDR002W YDR003W YDR004W YDR005C YDR006C YDR007W YDR008C YDR009W YDR010C YDR011W YDR012W YDR013W YDR014W YDR015C YDR016C YDR017C YDR018C YDR019C

IV-B

YDR078C YDR079W YDR080W YDR081C YDR082W YDR083W YDR084C YDR085C YDR086C YDR087C YDR088C YDR089W YDR090C YDR091C YDR092W YDR093W YDR094W YDR095C YDR096W YDR097C YDR098C YDR099W YDR100W YDR101C YDR102C YDR103W YDR104C YDR105C YDR106W

VII-A

YGL110C YGL109W YGL108C YGL107C YGL106W YGL105W YGL104C YGL103W YGL102C YGL101W YGL100W YGL099W YGL098W YGL097W YGL096W YGL095C YGL094C YGL093W YGL092W YGL091C YGL090W YGL089C YGL088W YGL087C YGL086W YGL085W YGL084C YGL083W YGL082W YGL081W YGL080W YGL079W YGL078C YGL077C YGL076C YGL075C YGL074C YGL073W YGL072C YGL071W

VII-B

YGL004C YGL003C YGL002W YGL001C YGR001C YGR002C YGR003W YGR004W YGR005C YGR006W YGR007W YGR008C YGR009C YGR010W YGR011W YGR012W YGR013W YGR014W YGR015C YGR016W YGR017W YGR018C YGR019W YGR020C YGR021W

X

YJR003C YJR004C YJR005W YJR006W YJR007W YJR008W YJR009C YJR010W YJR011C YJR012C YJR013W YJR014W YJR015W YJR016C YJR017C YJR018W YJR019C YJR020W YJR021C YJR022W YJR023C YJR024C YJR025C YJR026W YJR027W YJR029W YJR028W YJR030C YJR031C YJR032W YJR033C YJR034W YJR035W YJR036C YJR037W YJR038C YJR039W YJR040W YJR041C YJR042W YJR043C YJR044C YJR045C YJR046W YJR047C YJR048W YJR049C YJR050W YJR051W YJR052W YJR053W YJR054W YJR055W YJR056C YJR057W

XI-A

YKL145W YKL144C YKL143W YKL142W YKL141W YKL140W YKL139W YKL138C YKL137W YKL136W YKL135C YKL134C YKL133C YKL132C YKL131W YKL130C YKL129C YKL128C YKL127W YKL126W YKL125W YKL124W YKL123W YKL122C YKL121W

XI-B

YKR010C YKR011C YKR012C YKR013W YKR014C YKR015C YKR016W YKR017C YKR018C YKR019C YKR020W YKR021W YKR022C YKR023W YKR024C YKR025W YKR026C YKR027W YKR028W YKR029C YKR030W YKR031C YKR032W YKR033C YKR034W YKR035C YKR036C YKR037C YKR038C YKR039W YKR040C YKR041W

XIII

YML122C YML121W YML120C YML119W YML118W YML117W YML116W YML115C YML114C YML113W YML112W YML111W YML110C YML109W YML108W YML107C YML106W YML105C YML104C YML103C YML102W YML101C YML100W YML099C YML098W YML097C YML096W YML095C YML094W YML093W YML092C YML091C YML090W YML089C YML088W YML087C YML086C

XIV-A

YNL177C YNL176C YNL175C YNL174W YNL173C YNL172W YNL171C YNL170W YNL169C YNL168C YNL167C YNL166C YNL165W YNL164C YNL163C YNL162W YNL161W YNL160W YNL159C YNL158W YNL157W YNL156C YNL155W YNL154C YNL153C YNL152W YNL151C YNL150W YNL149C YNL148C YNL147W YNL146W YNL145W YNL144C YNL143C YNL142W YNL141W YNL140C YNL139C YNL138W YNL137C YNL136W YNL135C

XIV-B

YNL064C YNL063W YNL062C YNL061W YNL059C YNL058C YNL057W YNL056W YNL055C YNL054W YNL053W YNL052W YNL051W YNL050C YNL049C YNL048W YNL047C YNL046W YNL045W YNL044W YNL043C YNL042W YNL041C YNL040W YNL039W

XVI

YPR041W YPR042C YPR043W YPR044C YPR045C YPR046W YPR047W YPR048W YPR049C YPR050C YPR051W YPR052C YPR053C YPR054W YPR055W YPR056W YPR057W YPR058W YPR059C YPR060C YPR061C YPR062W YPR063C YPR064W YPR065W YPR066W YPR067W YPR068C YPR069C YPR070W

Gene enrichments

In order to establish citeria for enrichment p-values returned by g:profiler, we first input random sets of 30 genes and note the returned p-value (or ‘1’ if nothing was returned):

0.008 0.015 0.021 0.023 0.03 0.041 0.05 1 1 1 1 1 1 1 1 1 1 1 1 1

Our criteria for inclusion of g:profiler results become:

  • An adjusted p-value that is less than 1%, or
  • An adjusted p-value that is less than 5% and more than 5 genes appear in the enrichment category
Community Term
II acetyl-CoA transmembrane transporter activity
IV-A Factor: Gal4p, Factor: Gal4p, Factor: SWI4P
IV-B
VII-A
VII-B Factor: DAL81
X Factor: Mbp1p, Factor: ARG
XI-A
XI-B
XIII
XIV-A
XIV-B
XVI

Sphere test

We shall now describe a generic framework which was utilized throughout the rest of this work, to segment the three-dimensional genome into areas of interest regarding certain local properties of genes and chromatin in general. We will refer to this framework as a “sphere test”.

Sphere sampling

We sample sets of genes using randomly placed spheres of a given radius. A commonly used value for the radius is 15 units. This allows for sampling sets of up to approximately 150 genes. We then set a minimum acceptable size for a sample. Usually this is set to 50 genes. We do so to ensure that our calculations will be performed on a gene set of suitable size to do statistics upon.

Sampling procedure

Gene and chromatin signal statistics

We then calculate a given metric on each set of sampled genes. Properties like histone modification frequencies, presence or absence of transcription factor binding motifs, gene coexpression scores et cetera -collectively referred to as “signals”- are used as inputs to measure a statistical property of the local set of genes collected by each sphere.

Monte-Carlo estimation of a p-value

We then proceed to calculate the same metric for random sets of genes, having the same gene count as the sphere-sample. We calculate a p-value as the ratio of random occurrences that are equal or more extreme to the metric produced by the sphere. Afterwards, we adjust the p-values by applying the Benjamini-Hochberg method.

Filtering by adjusted p-value

We set a limit for the adjusted p-value, usually 1% and filter the sphere-samples below this limit. These sphere-samples are then considered to exhibit a significantly localized aspect of the signal.

Hierarchical clustering of the significant spheres

The resulting sets of genes frequently overlap. We consider overlapping samples as belonging to the same local set of genes. A limit of a minimum 5% overlap is applied. Sets connected with less than 5% of their genes are considered distinct.

We use an iterative approach to clustering. It can be summarized as follows:

  • Identify the pair of clusters that overlaps the most.
  • Merge this pair into a new cluster.
  • Repeat until we reach the 5% limit, at which point we stop.

Hierarchical clustering

Sphere-test data flow diagram

Sphere-test data flow

Histone-space proximity sphere-test

Our first sphere-test measures proximity in histone-space. Histone-space is a 9-dimensional space, where each dimension corresponds to one of our selected promoter-site histone modifications. A point in that space corresponds to a particular histone modification profile.

The metric for this test is the average distance in histone space, defined as follows.

Given a set of genes G, we denote as P the set of all possible unordered pairs of G. A histone distance is defined for a pair as the Euclidean distance in histone-space for the two genes of the pair. Then, the average histone distance in P is our metric for this particular sphere-test.

It follows that, a set of genes scoring low in this metric contains a more consistent histone modification profile than one scoring higher.

We employ a FDR of 1% to determine statistically significant samples, and this result in seven distinct (zero overlaps) local clusters of very coherent histone modification profiles. Total gene count is approximately 1000, which roughly represents 1/6th of the yeast genome.

sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN PromoterFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
promoterFields <- dbGetQuery(conn, sql)
plot_ly(x=promoterFields$x, y=promoterFields$y, z=promoterFields$z, type="scatter3d", mode="markers", text=promoterFields$Gene, color=promoterFields$Field, colors=fieldColors)
dbDisconnect(conn)

Internal vs external

Promoter fields pepetrate deep into the nucleus, thus not differentiating between “Internal” and “External” rough features:

sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN PromoterFields f ON l.Gene = f.Gene WHERE l.Gene IN (SELECT Gene FROM RoughFeatures WHERE Label='Internal')"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
promoterFields <- dbGetQuery(conn, sql)
plot_ly(x=promoterFields$x, y=promoterFields$y, z=promoterFields$z, type="scatter3d", mode="markers", text=promoterFields$Gene, color=promoterFields$Field, colors=fieldColors)
dbDisconnect(conn)

Histone profiles

We then attempt a pair plot of all 9 histone modifications and their associated classes.


# Disabling because this takes a long time to produce. Add ```{r} to enable again.

conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
histoneProfiles <- dbGetQuery(conn, "SELECT H3K4, H3K36, H3NTerm, H4NTerm, H2A, H2B, H3, H4, H2AZ, Field FROM HistonesPromoterPatched h JOIN PromoterFields p ON h.Gene = p.Gene")
dbDisconnect(conn)


use_python('C:\\Users\\ProteinMetrics\\AppData\\Local\\Programs\\Python\\Python38\\python.exe')
sns <- import('seaborn')
plt <- import('matplotlib.pyplot')
pd <- import('pandas')

plot <- sns$pairplot(r_to_py(histoneProfiles), hue = 'Field')
plot$savefig("images/PromoterFieldsHistonePairPlot.png")

Histone modification profiles pair plot

Calling the histone profiles by absolute value is not possible. All distributions are almost identical to the overall histone modification distribution.

What differs, however, is a gentle displacement of the average in each dimension. The distinguishing characteristic of our clusters is the accumulation of such tendencies in multiple dimensions. This lowers the average histone profile distance to statistically significant levels.

The natural way to represent our data is therefore a quantification of the gentle shifts for each cluster.

For each histone modification, we calculate the average of each group. Then we consider the distribution of the averages and assign a z-score to each group. The 9-vector of z-scores will then characterize the collective tendencies of the group.

colors <- c("#FF0000", "#FF0000", "#000000", "#00FF00", "#00FF00")
colors <- colorRampPalette(colors)(25)

tendencies <- read.table(
  "../../Results/PromoterFieldsTendencies.tsv",
  sep="\t", header=TRUE, row.names=1)
tendenciesScaled <- scale(tendencies)
heatmap.2(tendenciesScaled, col=colors)

Observations:

  • Every cluster occupies its own spot in the space of possible tendencies.
  • The histogram drawn over the color key verifies the deviation behavior. The distribution is bimodal.
  • The 2 smaller clusters (A and B) are similar: they overrepresent H2A, H2B, H3 and H4.
  • The largest cluster, G, is roughly antithetical to the smaller ones (A and B)

The table below summarizes the more prominent tendencies for Over- and Under-representation observed in each cluster:

Over Under
A H3K36, H3, H4
B H2A, H2B H3K4, H3K36, H3NTerm
C H3K4
D (H3NTerm) (H2A)
E H4NTerm, H3, H2AZ
F H2AZ H2A
G H3NTerm, H4NTerm H2B, H4

Tendencies of cluster D have been parenthesized, to show that these are not most dominant among the group. Rest of the table contains most dominant tendencies per histone variant.

Discussion

What makes the clusters statistically significant is not a substantial shift in histone modification profiles. It is a subtle tendency towards over- or under-representation of each histone variant. Statistical significance is therefore explained by the entire gene group’s decision to move towards one or the other side. We didn’t select the groups for this. The statistical test was histone-space compactness. Thus, the groups must be of reduced variance. Let’s add a figure to illustrate that.

Gene ontology enrichment

Only the larger clusters produce enriched terms:

Term PAdj
D Terpenoid backbone biosynthesis 4%
E Ribosomal small subunit binding 5%
F Factor: HSF; motif: AGAANAGAANAGAAN 2%
G Factor: HSF; motif: NTTCTAGAANAGAAN 1%

Motif fields

Same test structure, different metric.

We now measure average pairwise Jaccard distance in presence / absence of 102 transcription factor motifs found in a gene group.

sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN MotifFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
motifFields <- dbGetQuery(conn, sql)
plot_ly(x=motifFields$x, y=motifFields$y, z=motifFields$z, type="scatter3d", mode="markers", text=motifFields$Gene, color=motifFields$Field, colors=fieldColors)
dbDisconnect(conn)

ChromHMM overlap with field ‘A’

MotifFields:A overlaps 96.5% with Tethys.

  • PHD1 (Transcriptional activator that enhances pseudohyphal growth)
  • PHO4 ( transcription factor of the myc-family; activates transcription cooperatively with Pho2p in response to phosphate limitation)
  • MSN4 (stress responsive)
  • TEC1 (filamentous growth)
  • D IG1 (inhibitor of pseudohyphal growth, inhibits STE12 )
  • STE12 (invovled in mating and growth)

Jaccard index motif fields

Using Jaccard index instead of distance, we get more significant p-values. Second tail (nwar p-value 1) is still present but amaller). We can now afford to use an FDR of 1% (previously 5%).

sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN JaccardIndexMotifFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
motifFields <- dbGetQuery(conn, sql)
plot_ly(x=motifFields$x, y=motifFields$y, z=motifFields$z, type="scatter3d", mode="markers", text=motifFields$Gene, color=motifFields$Field, colors=fieldColors)
dbDisconnect(conn)

Coexpression fields

Same test structure, different metric.

We now measure average pairwise coexpression score within a gene group.

sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN CoexFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
coexFields <- dbGetQuery(conn, sql)
plot_ly(x=coexFields$x, y=coexFields$y, z=coexFields$z, type="scatter3d", mode="markers", text=coexFields$Gene, color=coexFields$Field, colors=fieldColors)
dbDisconnect(conn)

Field overlap

Using a python script to calculate minimum and maximum overlaps between all fields. MinimumOverlap = commonGeneCount / max(geneCount) MaximumOverlap = commonGeneCount / min(geneCount)

Keeping pairs with a maximum overlap greater than 0.5:

Field A Field B Common genes Min Overlap Max Overlap
PromoterFields:A CoexFields:G 62 0.04 1.0
PromoterFields:B CoexFields:G 116 0.08 0.97
PromoterFields:C CoexFields:F 120 0.12 1.0
PromoterFields:D CoexFields:F 160 0.16 1.0
PromoterFields:F CoexFields:G 192 0.13 0.89
PromoterFields:G CoexFields:F 218 0.22 0.96
MotifFields:B CoexFields:G 82 0.05 0.85

Genome hemispheres

We observe that only CoexFields have overlaps with other kinds of fields, and fields F and G in particular, which are the larger ones. Taking a closer look at the rendering, we see that coexpression fields occupy different hemispheres of the nucleus. The hemispheres are separated by a sea of “insulator”. This is nicely demonstrated by clusters A and B. They are different sectors of the Hat region, and they face away from each other and towards clusters F and G (respectively) of the bubble. These two sectors of the hat are separated by the same insulator that separates the rest of the nucleus.

We also get a distinct “south pole” made by the centromeres.

Let’s now combine these into a set of 4 clusters, using continent names as a consise indication of geometry:

CREATE TABLE Continents(Gene TEXT, Continent Text)

DROP TABLE IF EXISTS Continents;
CREATE TABLE Continents(Gene TEXT, Continent TEXT);
INSERT INTO Continents(Gene, Continent) SELECT Gene, 'Laurasia' FROM CoexFields WHERE Field = 'B' OR Field = 'C' OR Field = 'D' OR Field = 'G';
INSERT INTO Continents(Gene, Continent) SELECT Gene, 'Godwana' FROM CoexFields WHERE Field = 'A' OR Field = 'F';
INSERT INTO Continents(Gene, Continent) SELECT Gene, 'Antarctica' FROM CoexFields WHERE Field = 'E';
INSERT INTO Continents(Gene, Continent) SELECT Gene, 'Tethys' FROM Loci WHERE Gene NOT IN (SELECT Gene FROM Continents);

Here is how it looks:

sql <- "SELECT l.*, Continent FROM Loci l JOIN Continents c ON l.Gene = c.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
continents <- dbGetQuery(conn, sql)
plot_ly(x=continents$x, y=continents$y, z=continents$z, type="scatter3d", mode="markers", text=continents$Gene, color=continents$Continent, colors=fieldColors)
dbDisconnect(conn)

And here are the respective overlaps. We have more fields overlapping now:

Field Continent Common genes Min Overlap Max Overlap
PromoterFields:A Laurasia 62 0.027 1.0
PromoterFields:B Laurasia 116 0.051 0.97
PromoterFields:F Laurasia 192 0.08 0.89
MotifFields:B Laurasia 82 0.04 0.85
—————- ———– ———— ———– ———–
PromoterFields:G Godwana 218 0.21 0.96
PromoterFields:C Godwana 120 0.12 1.0
PromoterFields:D Godwana 160 0.16 1.0
—————- ———– ———— ———– ———–
PromoterFields:E Tethys 98 0.035 0.57
MotifFields:A Tethys 55 0.02 0.96
MotifFields:C Tethys 76 0.03 0.78

Continent statistics

Box plots

10000 samples of pairwise coexpression score for each continent:

continentCoexpression <- read.table(file = '../../Results/ContinentCoexpressionScoreSamples.tsv', sep = '\t', header = TRUE)
boxplot(continentCoexpression, outline=FALSE, las=2)

The important distinction is that of “sea versus land”. Any combination of the 3 land continents will yield the same (better than average) coexpression score:

continentCoexpression <- read.table(file = '../../Results/CombinedContinentCoexpressionScoreSamples.tsv', sep = '\t', header = TRUE)
boxplot(continentCoexpression, outline=FALSE, las=2)

Statistical significance

Compartment p-value
Godwana 1.18e-16
Antarctica 1.03e-25
Laurasia 7.02e-09
Tethys 1.15e-54
Pangaea 7.85e-14

Detailed t-test results

Godwana

t.test(continentCoexpression$Godwana, continentCoexpression$Random)
## 
##  Welch Two Sample t-test
## 
## data:  continentCoexpression$Godwana and continentCoexpression$Random
## t = 6.4204, df = 19998, p-value = 1.389e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02866381 0.05385619
## sample estimates:
## mean of x mean of y 
##   1.09641   1.05515

Antarctica

t.test(continentCoexpression$Antarctica, continentCoexpression$Random)
## 
##  Welch Two Sample t-test
## 
## data:  continentCoexpression$Antarctica and continentCoexpression$Random
## t = 7.4618, df = 19969, p-value = 8.886e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.03466125 0.05935875
## sample estimates:
## mean of x mean of y 
##   1.10216   1.05515

Laurasia

t.test(continentCoexpression$Laurasia, continentCoexpression$Random)
## 
##  Welch Two Sample t-test
## 
## data:  continentCoexpression$Laurasia and continentCoexpression$Random
## t = 4.8569, df = 19997, p-value = 1.202e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01865038 0.04388962
## sample estimates:
## mean of x mean of y 
##   1.08642   1.05515

Tethys

t.test(continentCoexpression$Tethys, continentCoexpression$Random)
## 
##  Welch Two Sample t-test
## 
## data:  continentCoexpression$Tethys and continentCoexpression$Random
## t = -17.27, df = 19958, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1207367 -0.0961233
## sample estimates:
## mean of x mean of y 
##   0.94672   1.05515

Pangaea

t.test(continentCoexpression$Pangaea, continentCoexpression$Random)
## 
##  Welch Two Sample t-test
## 
## data:  continentCoexpression$Pangaea and continentCoexpression$Random
## t = 5.8854, df = 19988, p-value = 4.033e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02547789 0.05092211
## sample estimates:
## mean of x mean of y 
##   1.09335   1.05515

Land continents are also significantly different with each other

t.test(continentCoexpression$Godwana, continentCoexpression$Pangaea)
## 
##  Welch Two Sample t-test
## 
## data:  continentCoexpression$Godwana and continentCoexpression$Pangaea
## t = 0.47086, df = 19990, p-value = 0.6377
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.009678111  0.015798111
## sample estimates:
## mean of x mean of y 
##   1.09641   1.09335
t.test(continentCoexpression$Antarctica, continentCoexpression$Pangaea)
## 
##  Welch Two Sample t-test
## 
## data:  continentCoexpression$Antarctica and continentCoexpression$Pangaea
## t = 1.3822, df = 19926, p-value = 0.1669
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.003683477  0.021303477
## sample estimates:
## mean of x mean of y 
##   1.10216   1.09335
t.test(continentCoexpression$Laurasia, continentCoexpression$Pangaea)
## 
##  Welch Two Sample t-test
## 
## data:  continentCoexpression$Laurasia and continentCoexpression$Pangaea
## t = -1.0644, df = 19993, p-value = 0.2872
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.019691278  0.005831278
## sample estimates:
## mean of x mean of y 
##   1.08642   1.09335

Laurasia more dependent on the others

Note that Pangaea is actually more tightly coexpressed than its subset “Laurasia”. Thus, Laurasian genes have a greater coexpression score with genes in Godwana and Antarctica, while the latter two are tighter than Pangaea.

Antarctica significantly different to Pangaea

Gene ontology enrichment

“Tethys” (coexpression insulator) cluster has a 2% PAdj for “Metabolit pathways”.

TightCommunities are continent-agnostic

These “Internal” gene sets have parts in all continents:

SELECT Chromosome, Continent, COUNT(Continent) FROM TightCommunities JOIN Continents 
ON TightCommunities.Gene = Continents.Gene 
JOIN Loci ON Loci.Gene = TightCommunities.Gene
GROUP BY Chromosome,Continent
Chromosome Continent GeneCount
2 Godwana 34
4 Antarctica 22
4 Laurasia 26
4 Tethys 13
7 Laurasia 21
7 Tethys 44
10 Antarctica 7
10 Laurasia 27
10 Tethys 21
11 Antarctica 22
11 Laurasia 25
11 Tethys 10
13 Godwana 24
13 Tethys 13
14 Laurasia 68
16 Laurasia 18
16 Tethys 12

Continent enrichment in SSD / WGD

Continent GeneCount SSD SsdRatio SsdEnrich WGD WGDRatio WGDEnrich
Antarctica 417.000 65 0.156 1.012 71 0.170 1.015
Godwana 1026.000 159 0.155 1.006 204 0.199 1.185
Laurasia 2266.000 345 0.152 0.988 431 0.190 1.134
Tethys 2787.000 414 0.149 0.964 361 0.130 0.772

Technical

The following diagram summarizes the flow of data, beginning from our inputs (signals) and ending at results.

From sources to results

This image may serve as a guide for reproducing parts of this work. The associated SVG file can be found in r/Chapters/images/FromSourcesToResults.svg and conveniently zoomed at. Also the source graph project is available in Diagrams folder (FromSourcesToResults.graphml).

Database

The light gray frame in the picture above represents the database. We use a single sqlite file as a central data structure. Various scripts and programs, independent from one another, perform the required transformations. This structure makes it easy to add components at will, with the sole requirement that each component respects the central data structure. This provides for a much cleaner namespace. For instance, all of our programs that do sphere tests, define a “Gene” class. Each time the class has different members (histones, TFs etc) and is loaded from different combinations of tables. This lets each particular program to solely focus on the task at hand and avoids complicated data structures and naming schemes. At the same time, all available data is accessible by all programs.

Replication timing

Let’s see how this new dataset looks:

conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
replicationTiming <- dbGetQuery(conn, "SELECT l.Gene, x, y, z, ReplicationTiming FROM Loci l JOIN ReplicationTiming r ON l.Gene = r.Gene")
plot_ly(x=replicationTiming$x, y=replicationTiming$y, z=replicationTiming$z, type="scatter3d", mode="markers", text=replicationTiming$Gene, color=replicationTiming$ReplicationTiming)
dbDisconnect(conn)

Metric used

We use standard deviation of replication timing in a sphere as a metric.

This signal is so smooth that almost all possible spheres have a far below metric than the randomized set of same size.

Maybe it makes more sense to find the rare spheres where smoothness is disrupted:

sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN ReplicationTimingFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
replicationTimingFields <- dbGetQuery(conn, sql)
plot_ly(x=replicationTimingFields$x, y=replicationTimingFields$y, z=replicationTimingFields$z, type="scatter3d", mode="markers", text=replicationTimingFields$Gene, color=replicationTimingFields$Field, colors=fieldColors)
dbDisconnect(conn)

Overlaps

  • ReplicationTimingFields:C has maximum overlap 100% with MotifFields:A (one is strict subset of the other).
  • ReplicationTimingFields:B is 100% within Tethys.

Worries

Field C is the same as MotifFields A, which was again the weird one, having more variance in TFs than expected by chance. These 123 genes are all located in chrXII, from ~300Kb to ~800KB, with a 200Kb gap between ~500-700.

Maybe we have the wrong (index space) coordinates for these genes? That would explain having significantly varied signals for both signal types.

Alternatively, this weird behavior is because of the rDNA that is present there.

sql <- "SELECT Gene, Chromosome, Start,End FROM Loci WHERE Gene IN (SELECT Gene FROM ReplicationTimingFields WHERE Field = 'C') ORDER BY Start"
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
weirdGenes <- dbGetQuery(conn, sql)
dbDisconnect(conn)
weirdGenes
##        Gene Chromosome  Start    End
## 1   YLR091W         12 322297 323179
## 2   YLR092W         12 323544 326226
## 3   YLR093C         12 326513 327416
## 4   YLR094C         12 327730 329239
## 5   YLR095C         12 329677 332116
## 6   YLR096W         12 332590 336034
## 7   YLR097C         12 336231 337266
## 8   YLR098C         12 337527 339474
## 9   YLR099C         12 339744 340929
## 10  YLR099W         12 341325 341589
## 11  YLR100W         12 341810 342854
## 12  YLR101C         12 342566 342962
## 13  YLR102C         12 342970 343768
## 14  YLR103C         12 343989 345942
## 15  YLR107W         12 364116 365331
## 16     snR6         12 366235 366347
## 17  YLR108C         12 366667 368125
## 18  YLR109W         12 368781 369312
## 19  YLR110C         12 369697 370099
## 20  YLR111W         12 370391 370724
## 21  YLR112W         12 370791 371211
## 22  YLR113W         12 371620 372928
## 23  YLR114C         12 374944 377239
## 24  YLR115W         12 377985 380565
## 25  YLR116W         12 380822 382253
## 26  YLR117C         12 382471 384535
## 27  YLR118C         12 384725 385409
## 28  YLR119W         12 385534 386176
## 29  YLR120C         12 386511 388221
## 30  YLR120W         12 388672 388774
## 31  YLR121C         12 388744 390271
## 32  YLR122C         12 390954 391332
## 33  YLR123C         12 391078 391408
## 34  YLR124W         12 391600 391945
## 35  YLR125W         12 393484 393895
## 36  YLR126C         12 394765 395521
## 37  YLR127C         12 395758 398320
## 38  YLR128W         12 398530 399434
## 39  YLR129W         12 399657 402489
## 40  YLR130C         12 402794 404063
## 41  YLR131C         12 404510 406823
## 42  YLR132C         12 407283 408156
## 43  YLR133W         12 408445 410194
## 44  YLR134W         12 410723 412415
## 45  YLR135W         12 413281 415528
## 46  YLR136C         12 415801 416659
## 47  YLR137W         12 417006 418110
## 48  YLR138W         12 418437 421395
## 49  YLR139C         12 421542 423474
## 50  YLR140W         12 423474 423801
## 51  YLR141W         12 423683 424775
## 52  YLR142W         12 425186 426617
## 53  YLR143W         12 427329 429387
## 54  YLR144C         12 429677 432017
## 55  YLR145W         12 432168 432774
## 56  YLR146C         12 432823 433726
## 57  YLR146W         12 433870 434059
## 58  YLR147C         12 434158 434464
## 59  YLR148W         12 434641 437398
## 60  YLR149C         12 437631 439824
## 61  YLR150W         12 440467 441289
## 62  YLR151C         12 441715 442738
## 63  YLR152C         12 442958 444689
## 64  YLR153C         12 445524 447576
## 65  YLR154C         12 447982 448315
## 66    RDN25         12 451786 455182
## 67    RDN37         12 451786 457733
## 68  YLR154W         12 452438 452624
## 69    RDN58         12 455414 455572
## 70    RDN18         12 455933 457733
## 71     RDN5         12 459676 459797
## 72  YLR155C         12 469317 470406
## 73  YLR156W         12 472113 472458
## 74  YLR156C         12 472479 472611
## 75  YLR157C         12 472969 474058
## 76  YLR157W         12 475764 475977
## 77  YLR158C         12 482549 483638
## 78  YLR159W         12 485345 485690
## 79  YLR159C         12 485711 485843
## 80  YLR160C         12 486201 487290
## 81  YLR161W         12 488997 489342
## 82  YLR162W         12 489573 489930
## 83  YLR163C         12 491867 493256
## 84  YLR163W         12 492813 492927
## 85  YLR164W         12 493884 494391
## 86  YLR165C         12 494495 495260
## 87  YLR166C         12 495430 498046
## 88  YLR167W         12 498948 499407
## 89  YLR168C         12 499579 500272
## 90  YLR169W         12 500336 500690
## 91  YLR170C         12 500580 501051
## 92  YLR171W         12 500734 501124
## 93  YLR172C         12 501261 502164
## 94  YLR173W         12 502422 504249
## 95  YLR174W         12 504592 505831
## 96  YLR175W         12 506135 507587
## 97  YLR176C         12 507798 510234
## 98  YLR177W         12 511055 512942
## 99  YLR178C         12 513163 513823
## 100 YLR179C         12 514109 514715
## 101 YLR180W         12 515263 516412
## 102 YLR181C         12 516679 517672
## 103 YLR182W         12 517941 520353
## 104 YLR183C         12 520544 522014
## 105 YLR184W         12 522107 522455
## 106 YLR185W         12 522664 523290
## 107 YLR186W         12 523633 524392
## 108 YLR310C         12 752225 756995
## 109 YLR311C         12 757266 757614
## 110 YLR312C         12 757638 758835
## 111 YLR312W         12 759481 760243
## 112 YLR313C         12 760356 762342
## 113 YLR314C         12 762574 764137
## 114 YLR315W         12 764807 765269
## 115 YLR316C         12 765265 766358
## 116 YLR317W         12 765654 766089
## 117 YLR324W         12 779214 780786
## 118 YLR325C         12 781142 781379
## 119 YLR326W         12 782173 782896
## 120 YLR327C         12 783126 783387
## 121 YLR328W         12 784912 786118
## 122 YLR329W         12 786441 787333
## 123 YLR330W         12 787663 789679